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Abstract — In this paper, a new automatic system for classifying 
ritual locations in diverse Hajj and Umrah video scenes is 
investigated. This challenging subject has mostly been ignored 
in the past due to several problems one of which is the lack of 
realistic annotated video datasets. HUER Dataset is defined to 
model six different Hajj and Umrah ritual locations |26|. 

The proposed Hajj and Umrah ritual location classifying 
system consists of four main phases: Preprocessing, segmenta- 
tion, feature extraction, and location classification phases. The 
shot boundary detection and background/foregroud segmentation 
algorithms are applied to prepare the input video scenes into 
the KNN, ANN, and SVM classifiers. The system improves the 
state of art results on Hajj and Umrah location classifications, 
and successfully recognizes the six Hajj rituals with more than 
%90 accuracy. The various demonstrated experiments show the 
promising resultsQ 
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I. Introduction 

During the last two decades, the field of visual recognition 
had an outstanding evolution from classifying instances of toy 
objects towards recognizing the classes of objects and scenes 
in natural images. Much of this progress has been sparked 
by the creation of realistic image datasets as well as by the 
new, robust methods for image description and classification. 
We take inspiration from this progress and aim to transfer 
previous experience to the domain of video recognition and 
the recognition of human actions in particular for Hajj and 
Umrah Videos HOl . 

Action recognition from video shares common problems 
with object recognition in static images. Both tasks have to 
deal with significant intra-class variations, background clutter 
and occlusions. In the context of object recognition in static 
images, these problems are surprisingly well handled by a bag- 
of-features representation lISTl combined with state-of-the-art 
machine learning techniques like support vector machines. It 
remains, however, an open question whether and how these 
results generalize to the recognition of realistic human actions, 
e.g., in feature films or personal videos. 

Building on the recent experience with image classification, 
we employ spatio-temporal features and generalize spatial 
pyramids to spatio-temporal domain. This allows us to extend 
the spatio-temporal bag-of-features representation with weak 
geometry. We validate our approach on a standard benchmark 
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Fig. 1. The System Model consists of four main phases: Pre-processing 
segmentation, feature extraction, and classifying location. 



1 161 and show that it outperforms the state-of-the-art. We next 
turn to the problem of action classification in realistic Hajj and 
Umrah videos and show promising results for eight very chal- 
lenging action classes including walking, drinking Zamzam 
water, sleeping, smiling, eating, praying, sitting, shaving hair, 
doing ablution, reading the Holy Quran and making duaa. 
Finally, we present and evaluate a fully automatic setup with 
action learning and classification obtained for an automatically 
labeled training datasets 



In 11261 . the authors proposed new event recognition datasets 
in Hajj and Umrah Videos. They presented two different 
classes of datasets: 1) Rituals locations dataset. 2) Human 
event recognitions dataset. 

This paper is organized as follows. Section |ll] presents 
the system description and proposed new model. In Sec- 
tions Uni [V] IVII we developed two algorithms for video pre- 
processing and background/foreground segmentations. Sim- 
ulation results are presented in Section IVIIII Section |IX] 
presents the related work and background. Finally, the paper 
is concluded in Section [X] 
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II. System Model AND Description 

In this section we define our research problem and propose 
a new solution model. 

Problem Definition: The problem can be defined as how 
to detect the ritual locations (Tawaf, Say, Arafat, Muzdalifa, 
Mina, Jamarat). In particular, the proposed framework is 
capable of recognizing a wide range of location classification 
in diverse Hajj and Umrah video and image scenes under 
different conditions. The main goal is to develop the proposed 
system to detect the ritual locations, see Fig. |2l 
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Fig. 2. Hajj and Umrah Pilgrim Events Recognition Datasets. Various images 
are taken from different places representing Hajj and Urm'ah lituals. 



Proposed System Model: This paper presents a location 
classification system during Hajj and Umrah seasons in the 
two Holy cities of Makkah and Madina. The proposed system 
is composed of four phases as follows: 

1) Pre-processing phase that segments the whole video 
stream into small video scenes and capturing the can- 
didate key frames from it. 

2) Segmentation phase that separates the foreground objects 
from the background from Hajj and Umrah images and 
videos. 

3) Feature extraction phase that defines the interest points 
and its description for the background and foreground 
images. 

4) classification phase that applies two types of classifica- 
tion, the pilgrim events for the foreground features, and 
the rites events for the background features resulted from 
the feature extraction phase. 

These four main phases are described in detail in the 
next four sections along with the characteristics feature steps 
involved in each phase. 



Input: The input video streams 
Output: Small video shots 
foreach Input video streams do 

- Capture frames from the input video. 

- Convert the gradual transition into cut transition by 
using frame skipping k (k = 10). 

foreach /rame do 

- Convert frames from RGB to HSI color space 
using Equations (|2]i, and (|3]l. 

- Calculate the motion difference between current 
hue frame and the next hue k-frame. 

- Divide the original color frames into 64 x 64 
block size. 

- Calculate the percentage of changed blocks 
between the current frame and the next k-frame. 
if (motion difference > Thr) and (block changing 
percentage > 0.25j then 

- Mark a new shot. 

- Select the k-frames using fc = 10 frames. 

end 

end 

end 

Algorithm 1: A shot boundary detection Algorithm (SBD- 
Alg) for Hajj and Umrah videos 



III. Pre-processing Phase 

The goal of the pre-processing phase is to segment the 
video stream into video shots and select the k-frames. The 
process of detecting the actual boundary of shots depends on 
object and camera motion by frame correlation. Each shot is 
defined as a sequence of frames captured by a single camera 
in a single continuous action in time and space 1201 . The 
correlation between frames in the same shot is a very important 
indicator to detect the similarity between them and when there 
is noticeable change we can conclude the starting of new shot 
as previously mentioned. 

There are two types of transitions depending on the 
camera movement: (1) instant (cut) transition and (2) gradual 
transition. When there is a special editing effect during the 
movie, it's called a gradual transition, whereas the instant 
transition has no editing effects and is more accurate than 
the gradual one. Most of Hajj and Umrah videos have a 
gradual transition, because these videos have been recorded 
using digital cameras, mobile phones by amateur individuals 
who are not professional or by a fixed El-Harram camera 
which needs a permission from Saudi authorities. The shot 
boundary detection algorithm (SB-Alg) [T] is adapted in order 
to generate video shots. 

The following equations describe the proposed algorithm. 
These equations convert frames from RGB to HSI color 
space 161. 

--^-^-^ (0.5.(i^-G) + (i^-i?)) 

^{{R-Gr + iR-B){G-By 

^^l-^^^|^-nin(i^,G,5), (2) 
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;{R + G + B), 



(3) 



where H, S and I are Hue, Saturation, Intensity, respectively; 
and R, G, and B are the traditional Red, Green and Blue colors. 

IV. Segmentation Phase 

Separating foreground objects from natural images and 
videos play an important role in image and video editing 
tasks. Despite extensive study in the last two decades, this 
problem still remains challenging. Segmenting spatio-temporal 
video objects from a video sequence are even harder since ex- 
tracted foregrounds on adjacent frames must be both spatially 
and temporally coherent. This section demonstrates efficient 
foreground extraction methods and systems by combining 
advanced computational algorithms 

A. Background Foreground Model 

The first step of background subtraction is to setup the 
background model or the reference image. The background 
is modeled in two distinct parts: a luminance model and a 
color model. Input video streams or input images have three 
channels with RGB components, but they are very sensitive 
to noise and changes of lighting conditions. Therefore, we 
use a luminance component of color images for initial object 
segmentation. Image luminance Y is calculated with the 
following Equation ifTTl : 



Y = 0.299i? + 0.587G + 0.114B. 



(4) 



However, the luminance component changes drastically 
by shadows of objects in the background regions and the 
reflection of lighting in the foreground regions. The HSI color 
space (Hue, Saturation, Intensity) is often used, because it 
corresponds better to how people experience color than the 
RGB color space does. As Hue varies from to 1.0, the cor- 
responding colors vary from red through yellow, green, cyan, 
blue, magenta, and back to red. Therefore, there are actually 
red values both at and 1.0. As saturation varies from to 
1.0, the corresponding colors (Hues) vary from unsaturated 
(shades of gray) to fully saturated (no white component). As 
value, or brightness, varies from to 1.0, the corresponding 
colors become increasingly brighter. Equations (|2|, and (O 
illustrate the HSI color space IS). 



Input: Video stream 

Output: Background and Foreground images sequence 

from the input video stream 
foreach Input video stream do 

Capture frames from the input video 
foreach /rame do 

Convert frames from RGB to HSI color space 
using Equations ([T]i, (|2]i, and (|3]l ||6j|. 

end 

foreach (H, S, and I) color components in HSI color 
space do 

Computing the /i and a parameters for the 
background statistical model for all pixels from 
the first N frames. 

end 

foreach /rame do 
foreach pixel do 

Calculate the background pixels distribution 
P{x) using gaussian normal distribution as 
shown in Equation (|5]i: 

end 

Calculate the correlation distance measure 
between the current frame and the background 
model or reference image as shown in Equations 
®: 

if Corr{x, y) < Thr then 

X belongs to the background pixels. 
Otherwise it belongs to the foreground pixels. 

end 

end 

Update the background statistical model parameters /i 

and cr. 
end 



Algorithm 2: Video Background and Foreground segmenta- 
tion 



Now the background subtraction algorithm |2] is summarized 
as follows ifTTl . 

The following equation describes and calculates the back- 
ground pixels distribution. 



P{x) 



1 -(^-m)^ 
-exp ^ 



/2ncr 



(5) 



where P{x) is the probabiHty distribution function, /i is 
the mean, and a is the standard deviation. 



B. Foreground Segmentation 

The background subtraction is based on the varying features 
of H, S, and V for each pixel in the background. After 
that subtract the current frame from the background model 
or reference image to obtain the moving video objects or 
the foreground image. So, to achieve this purpose, we use 
Gaussian normal distribution as a statistical model to model 
the distribution of each color component of the background's 
pixels. So, the background model is to acquire the mean 
and standard deviation of the color components of the back- 
ground's pixels. By detecting the variation of the pixels in 
the background model, the video objects can be segmented. 



The correlation distance measure is described in the follow- 
ing equation. 



Corr{x, y) = 1 — 



{x-X).{y-Y) 
\{x-X)\\\\{y-Y)\\ 



(6) 



where X is the mean of the current image, and Y is the 
mean of the reference image. 



V. Feature Extraction Phase 
In this section, we split feature extraction into two stages: 
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1) Feature detectors: The resulting features will be subsets 
of the image domain, often in the form of isolated points, 
continuous curves or connected regions. Feature detection 
is how to find some interesting points (features) in the 
image, as example (find a corner, find a template, and 
etc). 

2) Feature descriptors: It represents the interesting points we 
found to compare them with other interesting features in 
the image, for example (the local area intensity of this 
point, the local orientation of the area around the point, 
and etc). 

Most feature detectors involve the computation of deriva- 
tives or more complex measures such as the second moment 
matrix for the Harris detector or entropy for the salient regions 
detector. Since this step needs to be repeated for each and 
every location in feature coordinate space which includes 
position, scale and shape. This makes the feature extraction 
process computationally expensive, which is not suitable for 
many applications. 

We describe several feature detectors that have been de- 
veloped with computational efficiency as one of the main 
objectives. The SIFT uses the Difference of Gaussian (DoG) 
detector approximates the Laplacian uses multiple scale space 
pyramids, in addition SURF uses integral images to efficiently 
compute a rough approximation of the Hessian matrix. FAST 
evaluates only a limited number of individual pixel intensities 
using decision trees. 

The proposed system will model the Hajj and Umarh 
events through the information obtained with the tracking 
of the feature points. We rely on those motion vectors of 
feature points computed over multiple frames. After the pre- 
processing phase, we have two separates sequences of video 
frames, background and foreground. In the proposed system, 
we use the modified scale-invariant feature transform (SIFT) 
algorithm identifies features of an image that are distinct. 
These features can be used to identify similar or identical 
objects in other images as shown in Algorithm (O. 

A. Scale-Invariant Feature Transform ( SIFT) 

SIFT consists of four major stages: scale-space extrema 
detection, keypoint localization, orientation assignment, 
and keypoint descriptor. The first stage used difference-of- 
Gaussian function to identify potential interest points, which 
were invariant to scale and orientation. DOG was used instead 
of Gaussian to improve the computation speed. Gaussian 
image pyramid L{x,y,a) is generated by successively 
filtering the image I{x,y) with Gaussian filter G{x,y,(T) 
according to Equation (|7]i, and Adjacent Gaussian 

images are subtracted to produce the difference-of-Gaussian 
images (DoG) as in Equation (|9]l, which approximate the 
Laplacian of a Gaussian filtering ITSl . 

In the keypoint localization step, for more accurate 
localization of the keypoint and to remove the keypoints 
with low contrast, we just discard keypoints in which the 
absolute value of DoG pyramid at the interval they are 
detected is smaller than certain threshold. The keypoint is 



Input: The background or foreground images sequence 
Output: Features and labels for each new locations 
foreach background image do 

- Build a Gaussian image pyramid L(x, y, a) using 
Equations ©, dD, and ©. 

- Calculate the Hessian matrix as in Equation ( flOl l. 

- Calculate the determinant of the Hessian matrix as 
in Equation ( fTTT ) and eliminate the weak keypoints. 

- Calculate the gradient magnitude and orientation as 
in Equations (fTSl i and (flJl i. 

- Apply the sparse coding feature based on SIFT 
descriptors as in Equations ( fT4b and ( fTSl ). 

end 

Algorithm 3: Feature extraction using SIFT 



not lying on a strong edge. For this reason, we use the 
discrete differences between neighboring pixels around the 
keypoints to calculate the Hessian matrix. The interest points 
detected with the determinant of the Hessian to compute the 
principal curvatures and eliminate the keypoints that have a 
ratio between the principal curvatures lower than Thr. 

An orientation histogram is formed from the gradient 
orientations of sample points within a region around the 
keypoint. Each sample added to the histogram is weighted by 
its gradient magnitude and by a Gaussian-weighted circular 
window. We only assign one orientation to each keypoint 
which corresponds to the peak of the histogram. According 
to the experiments, the best results were achieved by 4 x 4 
array of histograms with 8 orientation bins in each. So the 
descriptor of SIFT that was used is 4a;4a;8 = 128 dimension 
vector lfT3]| . After that we applying sparse coding based on 
SIFT features. The sparse coding is to represent input vectors 
approximately as a weighted linear combination of the basis 
vectors which capture from the high level patterns in the input 
data as shown in Algorithm Q ifTSll . 
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Lix, y, a) = G{x, y, a) * /(x, y), 



D{x, y, a) = L{x, y, ka) - L{x, y, a), 



(7) 



(8) 



(9) 



where a is the scale parameter, G(x, y, cr) is Gaussian filter, 
I(x, y) is smoothing filter, L(x, y, a) is Gaussian pyramid, and 
D(x, y, a) is difference of Gaussian (DoG) 1241 . 



H 



h,,(x,(7) I, 



yy{x,CT)^ 



(10) 



Where Ixx is the second-order Gaussian smoothed image 
derivatives which detect signal changes in two orthogonal 
directions. 



Det{H) = Ixx{x,(7)Iyy{x,a) - {Ixy{x,cr)Y 



(11) 
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Mag{x,y) = {{I{x + l,y) - I{x - l,y)f 

+ {I{x,y + l)-I{x,y-l)fY/^ 



(12) 
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(14) 



(15) 



Where Xi is the SIFT descriptors feature, is mostly zero 
(sparse), (f> is the basis of sparse coding, A is the weights 
vector. 



where D{x, y) is the distance function, x is the query sample, 
y is the sample from the training set, n is feature dimension. 
The advantages of K-nearest neighbors are robust to noisy 
training data especially in inverse square of weighted distance. 
And disadvantages of K-nearest neighbors are need to deter- 
mine value of parameter K (number of nearest neighbors), and 
computation cost is quite high because we need to compute 
distance of each query instance to all training samples. 

B. Artificial Neural Network (ANN) 

The Artificial Neural Network (ANN) is an information 
processing paradigm that is inspired by the way biological 
nervous systems, such as the brain, process information. It 
is composed of a large number of highly interconnected 
processing elements (neurons) working in unison to solve 
specific problems. The functions of a biological neuron are 
modeled by computing a differentiable nonlinear function 
(such as a sigmoid) for each artificial neuron |[3|. 



VI. Location Classification Phase 

Event classification is a machine learning technique used to 
predict group membership for data instances. Any classifica- 
tion method uses a set of features or parameters to characterize 
each object, which called a supervised classification. We 
prepare the classes and provide a set of samples. In the training 
phase, the training set is used to decide how the parameters 
ought to be weighted and combined in order to separate the 
various classes of objects. In the testing or application phase, 
the weights determined in the training set are applied to a set 
of objects that do not have known classes in order to determine 
what their classes are likely to be Several major kinds of 
classification methods like k-nearest neighbor (KNN) classifier 
and support vector machine (SVM) classifier are used in the 
proposed system; we briefly give an overview for each of them 
in the following subsections. 

A. K-Nearest Neighbor (K-NN) 

The k-nearest-neighbor is a very simple classifier based on 
the nearest-neighbor approach. In this method, one simply 
finds in the N-dimensional feature space the closest object 
from the training set to an object being classified. Since the 
neighbor is nearby, it is likely to be similar to the object being 
classified and so is likely to be the same class as that object. 

In this classification algorithm, a new object is classified 
based on majority of K-nearest neighbor category (K is 
predefined integer), given a query point, the algorithm finds 
K number of objects or training points closest to the query 
point. Simply it works based on minimum distance from 
the searching query to the training one to determine the K- 
nearest neighbors. After that we gather K-nearest neighbors 
a simple majority of these K-nearest neighbors is used to be 
the prediction of the query instance IH . We use the Euclidean 
distance as the distance function as shown in Equation ( fT6] l. 



C. Support Vector Machine ( SVM) 

The support vector machine (SVM) algorithm seeks to 
maximize the margin around a hyperplane that separates a 
positive class from a negative class. Given a training dataset 
with n samples (xi, yi), (2:2,2/2), • • • , (a;„,2/„), where x^ is a 
feature vector in a v-dimensional feature space and with labels 
yi S {—1,1} belonging to either of two linearly separable 
classes Ci and C2. Geometrically, the SVM modeling algo- 
rithm finds an optimal hyperplane with the maximal margin to 
separate two classes, which requires to solve the optimization 
problem, as shown in Equations ( [TtI i and ( fTSb 123)1 . 



maximize 



n n 



Subject — to : a^yi, < a,; < C, 



(17) 



(18) 



D{x,y)=J2 



(16) 



where ai is the weight assigned to the training sample Xj. 
If ai > 0, Xi is called a support vector. C is a regulation 
parameter used to trade-off the training accuracy and the 
model complexity so that a superior generalization capability 
can be achieved. K is a kernel function, which is used to 
measure the similarity between two samples. Different choices 
of kernel functions have been proposed and extensively used 
in the past and the most popular are the gaussian radial basis 
function (RBF), polynomial of a given degree, and multi layer 
perception. These kernels are in general used, independently 
of the problem, for both discrete and continuous data. 

VII. Hajj and Umrah Rituals Classifications 

There are six rite locations during Hajj and Umrah are 
Tawaf, Sa'y between Safa and Marwa, Standing on mount 
Arafat, Staying overnight in Muzdalifah, Staying overnight in 
Mina, and Threw Jamarat. The models defined for this study 
are described below; 
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1) Tawaf: 

Tawaf is one of the Islamic rituals of pilgrimage. During 
the Hajj and Umrah, Muslims are to circumambulate 
the Kaaba (most sacred site in Islam) seven times, in a 
counterclockwise direction, see Fig. [3] 




Fig. 3. Examples of Tawaf around the Holy Kaaba 

2) Sa'y: Pilgrims, whether they are performing Hajj or 
Umrah perform sa'y after tawaf. Sa'y means endeavoring 
or making effort. For Hajj, this is held to commemorate 
Hagar's running between Safa and Marwa seven times 
in order to find water for her son, Ishmael, whom she 
was still breast-feeding, see Fig. ID 

3) Arafat: The plain of Arafat and Mount Arafat in 
Saudi Arabia, about three million pilgrims congregated 
to perform the most important rite of Hajj, or the 
pilgrimage. This site is significant because it is on the 
Mount of Mercy that the Prophet Muhammad gave his 
final sermon. Many pilgrims cHmb the hill and try to 
touch the pillar that marks this place. After Arafat, 
pilgrims will move to Muzdalifah to complete the 
remaining rites of the pilgrimage, see Fig. |5] 

4) Muzdalifah: Staying in Muzdalifah is obligatory upon 
the one performing the Hajj to spend the tenth (10th) of 
Dhul-Hijjah until the time of Fajr prayer, see Fig. |6l 

5) Mina: Mina, seven kilometres east of the Masjid El- 
Harram is where Hajj pilgrims sleep overnight on the 8th, 
11th, 12th (and some even on the 13th) of Dhul Hijjah. 





Fig. 6. Examples of Muzdalifah 

It contains the Jamarat, the three stone pillars which are 
pelted by pilgrims as part of the rituals of Hajj, see Fig.|5] 




Fig. 7. Examples of Mina 

6) Jamarat: The pilgrim who throws Jamarat before Zawal 
on the eleventh day and the following days will have 
to throw the pebbles again after Zawal if the days of 
throwing the pebbles have not yet expired, see Fig. [8] 

VIII. Experimental Results and Analysis 

The proposed system is evaluated using many videos from 
Hajj and Umrah datasets |26j . There are two categories for 
the quality of the used recorded videos: 
(a) High resolution, where frame size of 1280 x 720 pixels. 
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Fig. 8. Examples of the Jamarat 





Fig. 9. Shot detection results as explained in Table I. 



Fig. 10. Rital classification results as explained in Table II. 



The results of the shot boundary detection algorithm are 
shown in Table U For the collection of Hajj and Umarh videos, 
the algorithm has correctly identified 281 shot boundaries, 
while 7 shot boundaries were missed, and in 11 cases a 
false shot was identified. Accordingly, the average recall and 
precision were 97.53% and 96.42% respectively as shown in 
Fig.S 

Table HIl illustrates results of event rites classification using 
KNN-based, ANN-based, and SVM-based. Compared to the 
performance results obtained using SVM classifier for the 
proposed system attained a good result higher than the other 
classifiers concerning the accuracy which is 95.33% as shown 
in Fig. [To] 



(b) Low resolution, where frame size of 640 x 480 pixels. 
During experiments, both quality categories are covered. All 
Hajj and Umarh videos are in Audio Video Interleave (AVI) 
format with a frame rate of 30 fps. 

Two indicators; namely, recall and precision, have been 
designed and calculated for each one of the proposed system 
phases in order to evaluate the performance of the proposed 
system and measure the resulted accuracy at each phase. 
Recall and precision ratios are the basic measures used in 
evaluating search strategies. Recall is the ratio of the number 
of relevant records retrieved to the total number of relevant 
records in the database. Precision is defined as the ratio of 
the number of relevant records retrieved to the total number 
of irrelevant and relevant records retrieved. Both recall and 
precision are usually expressed as a percentage 12 5 II . Equations 
( fT9] l and ( |20l ) describe calculations of recall and precision 
ratios, respectively. 

Recall = , , (19) 

Precision ~ (20) 

where, tp, fp, and /„ represent the true positive , false 
positive , and false negative samples, respectively. 



IX. Background & Related Work 

Our script-based annotation of human actions is similar 
in spirit to several recent papers using textual information 
for automatic image collection from the web lfT2l . ifTsl and 
automatic naming of characters in images JT] and videos 
(|5|. Differently to this work we use more sophisticated text 
classification tools to overcome action variability in text. 

Discriminative interest points can be selected using 
boosting IHl, |fT9l . Given a large number of training samples, 
boosting can select the most discriminative and representative 
interest points from a set of randomly generated cuboids. 
However, boosting usually requires a huge amount of training 
samples, making it less applicable to small motion datasets. 

Efros et. al. jjU proposed an approach to recognize hu- 
man actions at low resolutions which consisted of a motion 
descriptor based on smoothed and aggregated optical flow 
measurements over a spatio-temporal volume centered on a 
moving figure. This spatial arrangement of blurred channels 
of optical flow vectors is treated as a template to be matched 
via a spatio-temporal cross correlation against a database of 
labeled example actions. 

Bobick et. al. ||2l computed Hu moments of motion energy 
images and motion-history images to create action templates 
based on a set of training examples which were represented by 
the mean and covariance matrix of the moments. Recognition 
was performed using the Mahalanobis distance between the 



8 



TABLE I 

Results of shot detection algorithm for 292 total experimental images: Among 86 images, there are 83 Tawaf correct images, 2 

TAWAF, which are not discovered, and 3 IMAGES FROM OTHER CLASSES, WHICH ARE ASSIGNED TO TAWAF. 



Event 


Tawaf 


Sa'y 


Arafat 


Muzdalifah 


Mina 


Jamarat 


Total 


86 


45 


57 


23 


32 


49 


Correct 


83 


44 


55 


23 


30 


46 


False 


3 


1 


2 





2 


3 


Miss 


2 


2 


1 


1 





1 


Precision 


96.5% 


97.8% 


96.5% 


100% 


93.8% 


93.9% 


Recall 


97.6% 


95.7% 


98.2% 


95.8% 


100% 


97.9% 



moment description of the input and each of the known 
actions. 

Shechtman and Irani ifTSll avoid explicit flow computations 
by employing a rank-based constraint directly on the intensity 
information of spatio-temporal cuboids to enforce consistency 
between a template and a target. Given one example of an 
action, spatio-temporal patches are correlated against a testing 
video sequence. Detections are considered to be those loca- 
tions in space-time which produce the most motion consistent 
alignments. 

Similar to ours, several recent methods explore bag-of- 
features representations for action recognition fTJ, ll22l . but 
only address human actions in controlled and simplified 
settings. Recognition and localization of actions in movies 
has been recently addressed in ifTTI for a limited dataset, for 
example, the manual annotation of two action classes. 

Nelson lfT4l . developed methods for recognizing human 
motions by obtaining spatio-temporal templates of motion and 
periodicity features from a set of optical flow frames. These 
templates were then used to match the test samples with the 
reference motion templates of known activities. 

X. Conclusion 

In this paper, we proposed a new event location-based 
classification system during Hajj and Umrah for video and 
image scenes. We applied the SIFT and sparse coding al- 
gorithms for local features to keep the relationship between 
these local features. After that we used event recognition 
classifiers to get the location from the six rituals (Tawaf, Sa'y, 
Arafat, Muzdalifah, Mina, and Jamarat). We used three type 
of classifiers (ANN, KNN, and SVM), and apparently SVM 
classifier achieves the best results. 

In our future work, as the Hajj and Umarh (HUER) dataset 
is limited, so, we will extent it with additional images and 
videos to check their results with our new system. In addition, 
we will apply other algorithms for feature extraction, like 
(SURF and FAST), and compare the new results with SIFT 
method. Certainly, we can achieve more improvement in the 
artificial neural networks by using feed backward, and Radial 
Basis Functions. 
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